AIs have a big problem with truth and correctness – and human thinking appears to be a big part of that problem. A new generation of AI is now starting to take a much more experimental approach that could catapult machine learning way past humans.
Remember Deepmind's AlphaGo? It represented a fundamental breakthrough in AI development, because it was one of the first game-playing AIs that took no human instruction and read no rules.
Instead, it used a technique called self-play reinforcement learning (RL) to build up its own understanding of the game. Pure trial and error across millions, even billions of virtual games, starting out more or less randomly pulling whatever levers were available, and attempting to learn from the results.
Within two years of the start of the project in 2014, AlphaGo had beaten the European Go champion 5-0 – and by 2017 it had defeated the world's #1 ranked human player.
At this point, Deepmind unleashed a similar AlphaZero model on the chess world, where models like Deep Blue, trained on human thinking, knowledge and rule sets, had been beating human grandmasters since the 90s. AlphaZero played 100 matches against the reigning AI champion, Stockfish, winning 28 and tying the rest.
Human thinking puts the brakes on AI
Deepmind started dominating these games – and shoji, Dota 2, Starcraft II and many others – when it jettisoned the idea that emulating a human was the best way to get a good result.
Bound by different limits than us, and gifted with different talents, these electronic minds were given the freedom to interact with things on their own terms, play to their own cognitive strengths, and build their own ground-up understanding of what works and what doesn't.
AlphaZero doesn't know chess like Magnus Carlssen does. It's never heard of the Queen's Gambit or studied the great grandmasters. It's just played a shit-ton of chess, and built up its own understanding against the cold, hard logic of wins and losses, in an inhuman and inscrutable language it created itself as it went.
You can tell the RL is done properly when the models cease to speak English in their chain of thought
— Andrej Karpathy (@karpathy) September 16, 2024
As a result it's so much better than any model trained by humans, that it's an absolute certainty: no human, and no model trained on human thinking will ever again have a chance in a chess game if there's an advanced reinforcement learning agent on the other side.
And something similar, according to people that are better-placed to know the truth than anyone else on the planet, is what's just started happening with the latest, greatest version of ChatGPT.
OpenAI's new o1 model begins to diverge from human thinking
ChatGPT and other Large Language Model (LLM) AIs, like those early chess AIs, has been trained on as much human knowledge as was available: the entire written output of our species, give or take.
And they've become very, very good. All this palaver about whether they'll ever achieve Artificial General Intelligence... Good grief, can you picture a human that could compete with GPT-4o across the breadth of its capabilities?
But LLMs specialize in language, not in getting facts right or wrong. That's why they "hallucinate" – or BS – giving you wrong information in beautifully phrased sentences, sounding as confident as a news anchor.
Language is a collection of weird gray areas where there's rarely an answer that's 100% right or wrong – so LLMs are typically trained using reinforcement learning with human feedback. That is, humans pick which answers sound closer to the kind of answer they were wanting. But facts, and exams, and coding – these things do have a clear success/fail condition; either you got it right, or you didn't.
And this is where the new o1 model has started to split away from human thinking and start bringing in that insanely effective AlphaGo approach of pure trial and error in pursuit of the right result.
o1's baby steps into reinforcement learning
In many ways, o1 is pretty much the same as its predecessors – except that OpenAI has built in some 'thinking time' before it starts to answer a prompt. During this thinking time, o1 generates a 'chain of thought' in which it considers and reasons its way through a problem.
And this is where the RL approach comes in – o1, unlike previous models that were more like the world's most advanced autocomplete systems, really 'cares' whether it gets things right or wrong. And through part of its training, this model was given the freedom to approach problems with a random trial-and-error approach in its chain of thought reasoning.
It still only had human-generated reasoning steps to draw from, but it was free to apply them randomly and draw its own conclusions about which steps, in which order, are most likely to get it toward a correct answer.
And in that sense, it's the first LLM that's really starting to create that strange, but super-effective AlphaGo-style 'understanding' of problem spaces. In the domains where it's now surpassing Ph.D.-level capabilities and knowledge, it got there essentially by trial and error, by chancing upon the correct answers over millions of self-generated attempts, and by building up its own theories of what's a useful reasoning step and what's not.
So in topics where there's a clear right and wrong answer, we're now beginning to see this alien intelligence take the first steps past us on its own two feet. If the games world is a good analogy for real life, then friends, we know where things go from here. It's a sprinter that'll accelerate forever, given enough energy.
But o1 is still primarily trained on human language. That's very different from truth – language is a crude and low-res representation of reality. Put it this way: you can describe a biscuit to me all day long, but I won't have tasted it.
So what happens when you stop describing the truth of the physical world, and let the AIs go and eat some biscuits? We'll soon begin to find out, because AIs embedded in robot bodies are now starting to build their own ground-up understanding of how the physical world works.
AI's pathway toward ultimate truth
Freed from the crude human musings of Newton, and Einstein, and Hawking, embodied AIs will take a bizarre AlphaGo-style approach to understanding the world. They'll poke and prod at reality, and observe the results, and build up their own theories in their own languages about what works, what doesn't, and why.
They won't approach reality like humans or animals do. They won't use a scientific method like ours, or split things into disciplines like physics and chemistry, or run the same kinds of experiments that helped humans master the materials and forces and energy sources around them and dominate the world.
Embodied AIs given the freedom to learn like this will be hilariously weird. They'll do the most bizarre things you can think of, for reasons known only to themselves, and in doing so, they'll create and discover new knowledge that humans could never have pieced together.
Unshackled from our language and thinking, they won't even notice when they break through the boundaries of our knowledge and discover truths about the universe and new technologies that humans wouldn't stumble across in a billion years.
We're granted some reprieve here; this isn't happening in a matter of days or weeks, like so much of what's going on in the LLM world.
Reality is the highest-resolution system we know of, and the ultimate source of truth. But there's an awful lot of it, and it's also painfully slow to work with; unlike in simulation, reality demands that you operate at a painfully slow one minute per minute, and you're only allowed to use as many bodies as you've actually built.
So embodied AIs attempting to learn from base reality won't initially have the wild speed advantage of their language-based forebears. But they'll still be a lot faster than evolution, with the ability to pool their learnings among co-operative groups in swarm learning.
Companies like Tesla, Figure and Sanctuary AI are working feverishly at building humanoids to a standard that's commercially useful and cost-competitive with human labor. Once they achieve that - if they achieve that - they'll be able to build enough robots to start working on that ground-up, trial-and-error understanding of the physical world, at scale and at speed.
They'll need to pay their way, though. It's funny to think about, but these humanoids might learn to master the universe in their downtime from work.
Apologies for these rather esoteric and speculative thoughts, but as I keep finding myself saying, what a time to be alive!
OpenAI's o1 model might not look like a quantum leap forward, sitting there in GPT's drab textual clothing, looking like just another invisible terminal typist. But it really is a step-change in the development of AI – and a fleeting glimpse into exactly how these alien machines will eventually overtake humans in every conceivable way.
For a wonderful deeper dive into how reinforcement learning makes o1 a step-change in the development of AI, I highly recommend the video below, from the excellent AI Explained channel.
Source: OpenAI / AI Explained